Acorn User: China

home *** CD-ROM | disk | FTP | other *** search

/ Acorn User: China / Acorn User China CD-ROM (UK) (Disc B) / Acorn User China CD-ROM (UK) (Disc B).bin / BARNET / PROGRAMMING / DOCS / TOPTIPS < prev next >

Wrap

Text File | 1998-06-02 | 11.8 KB | 413 lines

<html> <head> <title> Top Tips on programming in ARM Assembler </title> </head> <body> This page demonstrates some of the optimisations which are possible when programming in ARM assembler. It's based on my experience of optimising other people's code, so it's real examples of tricks that people overlook. This page assumes a familiarity with the ARM instruction set, and programming in assembler in general. <h1> Use of conditionals </h1> Conditional execution of instructions is one of the best things about the ARM instruction set, and yet people overlook it so often. For example, returning with the V bit set from a function is often used to indicate an error occurred: <pre> STMFD r13!, {r14} BL func1 LDMVSFD r13!, {pc} ADD r0, r1, r2 BL func2 LDMVSFD r13!, {pc} SUB r1, r2, r0 BL func3 LDMVSFD r13!, {pc} LDMFD r13!, {pc}^ </pre> can be rewritten as: <pre> STMFD r13!, {r14} BL func1 ADDVC r0, r1, r2 BLVC func2 SUBVC r1, r2, r0 BLVC func3 LDMVCFS r13!, {pc}^ LDMFD r13!, {pc} </pre> A saving of 8 bytes in space, and 3 instructions in execution time for the case when no errors occur - which is hopefully the more common! I also find it more readable and it's easier to trace the flow of execution. Naturally, if you have to do comparisons then this is not possible and you'll have to either exit or branch over the conditional segment of code. It can be more efficient to do this if you have a long procedure exit sequence, but normally exiting is merely a matter of unstacking the registers and writing the appropriate value to the program counter which is one instruction. <h1> Use of the correct comparison </h1> The different types of comparison available often confuse the novice programmer. Use of signed comparisons instead of unsigned is the most common error which is made, which can be merely inefficient (and <a href="#final">here's an example</a>), but often leads to bugs when memory ranges are suddenly greater than 2GB. <pre> Unsigned Signed HS (CS) GE HI GT LS LE LO (CC) LT </pre> PL and MI directly test bit 31 of the result (ie the N flag is a copy of bit 31). They are subtly different to GE and LT and I'd like someone to give me a case where they should be used instead of GE or LT. <h1> Don't misuse the stack </h1> Obviously, don't stack registers that aren't required to be preserved - which these are depends on whether you are writing APCS conforming code, or whether you have your own Procedure Call Standard, or whether you work on an ad hoc basis. But it's not necessary to stack r14 if the function is a leaf function (calls no other functions). However, if you do find yourself in need of more registers than you have available, then r14 should be the first register you stack on the grounds of efficiency. Consider the two functions: <pre> STMFD r13!, {r14} MUL r0, r1, r2 LDMFD r13!, {pc} STMFD r13!, {r1, r2} LDR r1, [r12, #8] LDR r2, [r12, #12] MUL r0, r1, r2 LDMFD r13!, {r1, r2} MOVS pc, r14 </pre> Better would be: <pre> MUL r0, r1, r2 MOVS pc, r14 STMFD r13!, {r14} LDR r14, [r12, #8] LDR r0, [r12, #12] MUL r0, r14, r0 LDMFD r13!, {pc} </pre> In the second function, restoring registers is combined with returning from the procedure call. Since r14 must be corrupted by the function call, the caller is not expecting its value to be any particular value at the end of the function. This has been exploited in at least one program which I know of to return function values in r14, but you really need to be aware of exactly what you're doing with that technique, and I've never dared try it myself. <p> A further optimisation which can be used is to use a single store rather than a multiple store if you're only storing r14. The above code then becomes: <pre> STR r14, [r13, #-4]! LDR r14, [r12, #8] LDR r0, [r12, #12] MUL r0, r14, r0 LDR pc, [r13], #4 </pre> Beware that you can't tell LDR to restore the program counter flags. This is not an issue if you are using 32-bit APCS-3, but most routines on 26-bit ARMs preserve the flags over a function call. <p> It can sometimes be a win to only stack registers conditionally. Here's an example: <pre> LDR r2, [r12, #0] TEQ r0, r2 MOVEQ pc, r14 STMFD r13!, {r14} BL func SUB r0, r0, #1 LDMFD r13!, {pc} </pre> This can be more efficient if r0 is often equal to r2; for example if this were an error check. <p> Another way in which the stack can be abused is to store data _below_ the stack pointer. This will appear to work but you will get random crashes. This is because interrupt code is entitled to use space on the stack temporarily; if you have left the stack pointer in the wrong place, you will have problems. <h1> References to data within a big program </h1> When your program gets to more than 4k, you start to run the risk of not being able to address data. This is because LDR has a maximum range of +/-4095 bytes. ADR (which is a pseudoinstruction for ADD rN, pc, x, which takes into account the fact that the program counter is 2 instructions ahead of the currently executing instruction) uses the normal 8 bits rotated by a multiple of 2 scheme, so this can get you further away but at the cost of not being able to directly access every address. Many people have FNadrl macros and at least two assemblers that I know of have ADRL instructions. They are implemented as something like: <pre> ADR rN, (offset AND &FF) ADD rN, ((offset - P%)AND &FF00) </pre> with slight modifications to cope with a negative offset, and the cunning ones actually try various possibilities such as <pre> ADR rN, (offset AND &3FC) ADD rN, ((offset - P%) AND &3FC00) </pre> to gain them more distance. <p> Nevertheless, these pseudo-instructions take 2 instructions, and ADRX (for constants which can't be reached with ADRL) takes 3. It is better to avoid using these if possible. Try moving data around so that it's nearer the functions which reference it. If that's not possible, try moving functions around. <p> If you have data which are used throughout the program, consider allocating a register to point to their address throughout the execution of the program. In RISC OS modules, this is assisted by Acorn by pointing r12 at a block of private workspace. <p> Under no circumstances consider doing the following: <pre> BL getvar ... .getvar LDR r0, var MOVS pc, r14 .var EQUD 0 </pre> since this involves two pipeline flushes and wrecks most caching strategies so it is guaranteed to be slower. It also goes against the principle of keeping data and code separate which is more important on StrongARM processors with their split instruction/data caches. <h1> Working with large constants </h1> By large constants, I don't mean constants with a large magnitude like 2^31, I mean constants which don't fit in the ARM's scheme for immediate constants; ie 8 bits rotated by a multiple of 2. This example is from IscaFS: <pre> LDR r0, [r6, #24] BIC r0, r0, #&ff000000 BIC r0, r0, #&00ff0000 MOV r1, r1, #&53 ORR r1, r1, #&ef00 TEQ r0, r1 </pre> I replaced this with: <pre> LDR r0, [r6, #24] MOV r0, r0, LSL #16 EOR r0, r0, #&ef000000 TEQ r0, #&00530000 </pre> Shifting r0 left by 16 automatically zeroes the leftmost 16 bits and discards the previous top 16 bits, which is the desired effect. The crucial step is noticing that TEQ is the same as EORS, without a destination register. So if the top byte of r0 is not &ef then the second test can never be true. This trick saves 2 instructions and one register. <p> A similar technique can apply to other situations, for example checking that a 16 bit value is less than a given value can be done by <pre> CMP r0, #&xy00 CMPEQ r0, #&00za </pre> is the same as <pre> MOV r1, #&xy00 ORR r1, #&00za CMP r0, r1 </pre> but takes one instruction fewer and uses one register fewer. <p> If it's utterly unavoidable, you may need to put a large constant into a register. It is thought that there are no constants which cannot be produced in 3 instructions, though no-one has an algorithm for producing arbitrary constants (to the best of my knowledge). On a StrongARM, it is normally quicker to load an instruction instead of using 3 to synthesise it. On an ARM6, it is quicker to use 3 instructions. Take your pick. <h1> Strength reduction </h1> This term covers a number of optimisations. One is to reduce the cost of per-iteration calculations. <pre> MOV r0, #200 MOV r4, #8 .loop MUL r1, r0, r4 ... SUBS r0, r0, #1 BNE loop </pre> becomes <pre> MOV r1, #1600 .loop ... SUBS r1, r1, #8 BNE loop </pre> This saves a MUL instruction in the loop, which is a slow operation on many variants of the ARM. In this case, it also saves 2 registers, though in practice this is not often achieved. <h1> Count down, not up </h1> The above example also illustrates that it's better to count down in a loop than up. Here's a worse example: <pre> MOV r1, #0 .loop ... ADD r1, r1, #8 CMP r1, #1600 BNE loop </pre> The SUBS used above combines the test-for-end with the loop-count, saving an instruction. <h1> Unrolling loops </h1> Since the instructions used per-loop are not doing useful work, it is often better to unroll the loop a little. This can add extra overhead in places, so you should be cautious. It may also pay to unroll the loop entirely, as the C compiler may sometimes do with memset. To reuse the example above: <pre> MOV r1, #1600 .loop ... SUB r1, r1, #8 ... SUBS r1, r1, #8 BNE loop </pre> In this example, since I haven't specified what is in `...', it's not possible to combine the first SUB in with it which might be possible in real code. This optimistaion saves half an instruction (and a cache line flush) per loop - 800 in total. Against that, it takes more cache lines, and more RAM in general. The only way to find if this is a win or not is to benchmark the code in question. <h1> Dividing </h1> There are already several good algorithms out there and I don't think I have anything to contribute myself. Please see <a href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/Arc/Programmin/div-fast.html"> This page</a> and <a href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/Arc/Programmin/div-short.html"> this one</a> for good examples. <h1> Some further examples </h1> That's about all the coding tricks I can remember for the moment. If you want to see a real example of code which I've improved, try looking at IscaFS which was originally written by Phil Norman. <p> I'll leave you with some more fictional examples: <a name="final"></a> <pre> STMFD r13!, {r14} CMP r0, #5 BGT not CMP r0, #0 BLT not LDR r0, [r12, #8] B over .not MOV r0, #0 SUB r0, r0, #1 .over LDMFD r13!, {pc} </pre> Can be optimised to: <pre> CMP r0, #5 LDRLS r0, [r12, #8] MVNHI r0, #0 MOV pc, r14 </pre> This example demonstrates several points: <ol> <li> Use of unsigned instead of signed comparisons <li> Use of conditionals to avoid branches <li> Not storing the link register if avoidable <li> Remembering one of the more `esoteric' instructions </ol> Thanks for reading, I appreciate feedback and if you have any tricks you'd like to share with the ARM programming community at large then please send them to me. <p> You probably want to look at these pages for further tips: <a href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/">Robin Watts' ARM programming page</a> <p> I'd like to thank Phil Norman and Peter Burwood for their help. <hr> <i><a href="mailto:willy@bofh.ai">Matthew Wilcox</a></i> </body> </html>